Feat/backend#4
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request implements a comprehensive V2 backend refactoring for the Technology Trend Analysis Platform, transitioning from a monolithic CSV-only pipeline to a modular, serverless data stack with enhanced quality controls, dual-write capabilities, and frontend bridge support. The changes span 59 files with significant architectural improvements while maintaining backward compatibility.
Changes:
- Modularized ETL pipeline with parallel GitHub Actions jobs (GitHub, StackOverflow, Reddit) using artifact-based handoff and aggregation
- Implemented dual-write storage strategy supporting legacy CSV, latest snapshots, and date-partitioned history with configurable environment flags
- Added severity-based quality gate system with Pandera integration supporting critical/warning/info levels and degradation policies for partial source failures
- Introduced DuckDB-based Trend Score engine with equivalence tests validating numeric parity with legacy pandas implementation
- Created data product contract system with run/dataset manifests, SemVer versioning, and deterministic schema hashing
- Implemented frontend bridge JSON export for historical trend data with feature flag-based partial cutover and CSV fallback
Reviewed changes
Copilot reviewed 44 out of 45 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/etl_semanal.yml |
Refactored to parallel job architecture with artifact validation and conditional publishing |
backend/trend_score.py |
Added engine selector supporting legacy pandas and DuckDB implementations |
backend/trend_score_duckdb.py |
New DuckDB-based SQL engine for trend score computation |
backend/validador.py |
Enhanced with Pandera quality checks and severity-based issue routing |
backend/validate_csv_contract.py |
Updated with Pandera integration and configurable validation modes |
backend/quality/pandera_schemas.py |
New module defining dataset schemas and multi-severity quality rules |
backend/quality/degradation_policy.py |
New module implementing source availability degradation matrix |
backend/config/data_product_contract.py |
New contract defining run and dataset manifest structures with validation |
backend/config/schema_contract_utils.py |
New utilities for deterministic schema hashing and SemVer bump recommendations |
backend/sync_assets.py |
Enhanced with latest/legacy prioritization and bridge JSON export integration |
backend/export_history_json.py |
New module generating frontend bridge JSON from history snapshots |
backend/base_etl.py |
Updated with dual-write support for legacy/latest/history destinations |
backend/config/settings.py |
Added write strategy flags and path resolution utilities |
frontend/lib/services/csv_service.dart |
Enhanced with bridge JSON loading and automatic CSV fallback |
frontend/lib/config/feature_flags.dart |
New feature flag system for controlled bridge JSON cutover |
frontend/lib/screens/home_screen.dart |
Added temporal trend view card demonstrating bridge integration |
tests/* |
Comprehensive test coverage for new modules with 133 passing tests |
docs/* |
Updated architecture, contracts, and implementation roadmap documentation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -1,47 +1,61 @@ | |||
| # Política mínima de dependencias y seguridad | |||
| # Politica de Dependencias y Seguridad | |||
There was a problem hiding this comment.
The BOM (Byte Order Mark) character \ufeff is present at the beginning of several documentation files. While this doesn't break functionality, it can cause issues with some text processing tools and is generally considered unnecessary for UTF-8 files. Consider removing the BOM from these files for better compatibility.
| # Politica de Dependencias y Seguridad | |
| # Politica de Dependencias y Seguridad |
| @@ -1,85 +1,113 @@ | |||
| # Contrato de datos CSV (Backend ↔ Frontend) | |||
| # Contrato de Datos (Backend <-> Frontend) | |||
There was a problem hiding this comment.
The BOM (Byte Order Mark) character \ufeff is present at the beginning of this file. While this doesn't break functionality, it can cause issues with some text processing tools and is generally considered unnecessary for UTF-8 files. Consider removing the BOM for better compatibility.
| # Contrato de Datos (Backend <-> Frontend) | |
| # Contrato de Datos (Backend <-> Frontend) |
| return ["dataset manifest debe ser un objeto (dict/mapping)"] | ||
|
|
||
| for field in DATASET_REQUIRED_FIELDS: | ||
| if field not in dataset_manifest: | ||
| errors.append(f"falta campo requerido '{field}'") | ||
|
|
||
| dataset_name = dataset_manifest.get("dataset_logical_name") | ||
| if "dataset_logical_name" in dataset_manifest and not _is_non_empty_string(dataset_name): | ||
| errors.append("'dataset_logical_name' debe ser string no vacio") | ||
|
|
||
| version_semver = dataset_manifest.get("version_semver") | ||
| if "version_semver" in dataset_manifest and not is_valid_semver(version_semver): | ||
| errors.append("'version_semver' no cumple SemVer") | ||
|
|
||
| generated_at_utc = dataset_manifest.get("generated_at_utc") | ||
| if "generated_at_utc" in dataset_manifest and not is_valid_iso_utc(generated_at_utc): | ||
| errors.append("'generated_at_utc' no es ISO-8601 valido con zona horaria") | ||
|
|
||
| source_run_id = dataset_manifest.get("source_run_id") | ||
| if "source_run_id" in dataset_manifest and not _is_non_empty_string(source_run_id): | ||
| errors.append("'source_run_id' debe ser string no vacio") | ||
| if expected_run_id and source_run_id != expected_run_id: | ||
| errors.append("'source_run_id' no coincide con run_id del manifest principal") | ||
|
|
||
| schema_hash = dataset_manifest.get("schema_hash") | ||
| if "schema_hash" in dataset_manifest: | ||
| if not _is_non_empty_string(schema_hash) or _HEX64_RE.fullmatch(schema_hash.strip()) is None: | ||
| errors.append("'schema_hash' debe ser hash sha256 en hexadecimal (64 chars)") | ||
|
|
||
| row_count = dataset_manifest.get("row_count") | ||
| if "row_count" in dataset_manifest: | ||
| if not isinstance(row_count, int): | ||
| errors.append("'row_count' debe ser integer") | ||
| elif row_count < 0: | ||
| errors.append("'row_count' no puede ser negativo") | ||
|
|
||
| quality_status = dataset_manifest.get("quality_status") | ||
| if "quality_status" in dataset_manifest and quality_status not in DATASET_QUALITY_STATUSES: | ||
| errors.append(f"'quality_status' invalido: {quality_status}") | ||
|
|
||
| latest_path = dataset_manifest.get("latest_path") | ||
| if "latest_path" in dataset_manifest and not _is_non_empty_string(latest_path): | ||
| errors.append("'latest_path' debe ser string no vacio") | ||
|
|
||
| history_path = dataset_manifest.get("history_path") | ||
| if "history_path" in dataset_manifest: | ||
| if quality_status == "fail": | ||
| if history_path is not None and not _is_non_empty_string(history_path): | ||
| errors.append("'history_path' debe ser null o string no vacio cuando quality_status=fail") | ||
| elif not _is_non_empty_string(history_path): | ||
| errors.append("'history_path' debe ser string no vacio") | ||
|
|
||
| return errors | ||
|
|
||
|
|
||
| def validate_run_manifest(run_manifest: Mapping[str, Any]) -> tuple[bool, list[str]]: | ||
| """Validates minimal structure and rules for a run manifest.""" | ||
| errors: list[str] = [] | ||
|
|
||
| if not isinstance(run_manifest, Mapping): | ||
| return False, ["run manifest debe ser un objeto (dict/mapping)"] | ||
|
|
||
| for field in RUN_REQUIRED_FIELDS: | ||
| if field not in run_manifest: | ||
| errors.append(f"falta campo requerido '{field}'") | ||
|
|
||
| run_id = run_manifest.get("run_id") | ||
| if "run_id" in run_manifest and not _is_non_empty_string(run_id): | ||
| errors.append("'run_id' debe ser string no vacio") | ||
|
|
||
| generated_at_utc = run_manifest.get("generated_at_utc") | ||
| if "generated_at_utc" in run_manifest and not is_valid_iso_utc(generated_at_utc): | ||
| errors.append("'generated_at_utc' no es ISO-8601 valido con zona horaria") | ||
|
|
||
| for field in ("source_window_start_utc", "source_window_end_utc"): | ||
| value = run_manifest.get(field) | ||
| if field in run_manifest and not is_valid_iso_utc(value): | ||
| errors.append(f"'{field}' no es ISO-8601 valido con zona horaria") | ||
|
|
||
| quality_gate_status = run_manifest.get("quality_gate_status") | ||
| if "quality_gate_status" in run_manifest and quality_gate_status not in QUALITY_GATE_STATUSES: | ||
| errors.append(f"'quality_gate_status' invalido: {quality_gate_status}") | ||
|
|
||
| for field in ("git_sha", "branch"): | ||
| value = run_manifest.get(field) | ||
| if field in run_manifest and not _is_non_empty_string(value): | ||
| errors.append(f"'{field}' debe ser string no vacio") | ||
|
|
||
| datasets = run_manifest.get("datasets") | ||
| if "datasets" in run_manifest: | ||
| if not isinstance(datasets, list): | ||
| errors.append("'datasets' debe ser lista") | ||
| elif not datasets: | ||
| errors.append("'datasets' no puede estar vacio") | ||
| else: | ||
| for index, dataset_manifest in enumerate(datasets): | ||
| dataset_errors = validate_dataset_manifest( | ||
| dataset_manifest, | ||
| expected_run_id=run_id if _is_non_empty_string(run_id) else None, | ||
| ) | ||
| errors.extend(f"datasets[{index}]: {message}" for message in dataset_errors) |
There was a problem hiding this comment.
Multiple error messages in this file are in Spanish (e.g., 'dataset manifest debe ser un objeto', 'falta campo requerido', 'debe ser string no vacio', etc.). According to the coding style guide at docs/coding_style.md, backend modules should use English for comments and docstrings. Error messages should also follow this convention for consistency across the codebase. Consider translating these error messages to English.
| @@ -0,0 +1,62 @@ | |||
| # Estandar de Estilo del Repositorio | |||
There was a problem hiding this comment.
The BOM (Byte Order Mark) character \ufeff is present at the beginning of this file. While this doesn't break functionality, it can cause issues with some text processing tools and is generally considered unnecessary for UTF-8 files. Consider removing the BOM for better compatibility.
| # Estandar de Estilo del Repositorio | |
| # Estandar de Estilo del Repositorio |
| @@ -1,109 +1,96 @@ | |||
| # Architecture -- Technology Trend Analysis Platform | |||
| # Arquitectura del Proyecto | |||
There was a problem hiding this comment.
The BOM (Byte Order Mark) character \ufeff is present at the beginning of this file. While this doesn't break functionality, it can cause issues with some text processing tools and is generally considered unnecessary for UTF-8 files. Consider removing the BOM for better compatibility.
| # Arquitectura del Proyecto | |
| # Arquitectura del Proyecto |
| │ Flutter Web Dashboard │ | ||
| │ 4 views · fl_chart · Export ZIP · Responsive │ | ||
| └─────────────────────────────────────────────────────┘ | ||
| # Technology Trend Analysis Platform |
There was a problem hiding this comment.
The BOM (Byte Order Mark) character \ufeff is present at the beginning of this file. While this doesn't break functionality, it can cause issues with some text processing tools and is generally considered unnecessary for UTF-8 files. Consider removing the BOM for better compatibility.
| # Technology Trend Analysis Platform | |
| # Technology Trend Analysis Platform |
This pull request significantly restructures and enhances the weekly ETL pipeline workflow, improves environment configuration options, and updates CI/CD triggers and dependency audit handling. The main focus is on modularizing ETL jobs by data source, improving artifact management, and adding robust validation and publishing steps. Additionally, several environment variables and workflow triggers have been updated for better flexibility and reliability.
ETL Pipeline Refactor and Enhancement:
.github/workflows/etl_semanal.ymlworkflow is fully modularized: each data source (GitHub, StackOverflow, Reddit) now runs in its own job, with artifacts uploaded and aggregated in a dedicated aggregation job. This improves parallelism, error isolation, and maintainability. [1] [2] [3]Environment and Configuration Improvements:
.env.exampleand set in the ETL workflow, enabling more flexible and transparent configuration. [1] [2]CI/CD Workflow Updates:
main,feat/backend,feat/frontend), ensuring checks are run for active development streams. [1] [2].github/workflows/dependency_security.ymlnow ignores a known NLTK vulnerability (CVE-2025-14009) until a fix is available, preventing unnecessary pipeline failures.Deployment Workflow Safeguard:
mainbranch, reducing the risk of unintended deployments.Most important changes:
ETL Pipeline Modularization and Validation
.github/workflows/etl_semanal.ymlto run ETL jobs for GitHub, StackOverflow, and Reddit as separate jobs, each uploading its own artifacts, followed by an aggregation job that validates outputs, runs quality gates, and uploads aggregate artifacts for publishing. [1] [2] [3]Configuration and Environment
.env.exampleand set them in the ETL workflow for data write strategies (DATA_WRITE_LEGACY_CSV, etc.) and the trend score engine selector (TREND_SCORE_ENGINE), allowing for granular control of ETL outputs. [1] [2]CI/CD and Audit Workflow Improvements
feat/backendandfeat/frontendbranches, ensuring active feature branches are tested and audited. [1] [2]Deployment Workflow Safeguard
mainbranch, reducing accidental deployments from other branches.User-facing Improvements